[https://nvbugs/6025177][fix] Fix KV cache issue (cherry-pick to release/1.3.0rc5.post2)#12819
Conversation
…ase/1.3.0rc5.post2) Cherry-pick of NVIDIA#12673 onto release/1.3.0rc5.post2. Fixes KV cache corruption caused by storing blocks with over-counted unique token extent during chunked prefill. Introduces getUsableUniqueTokenCountForReuse() and getMaterializedUniqueTokenCountForReuse() helpers to correctly cap the number of tokens stored for reuse. Also moves simulatePrefillCompletion to a KvCacheManagerTestUtil test utility class to avoid polluting the production interface. Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR refactors KV cache reuse token counting logic by introducing helper functions that compute accurate usable unique token counts based on context state, replacing hardcoded assumptions. It adds test utilities to simulate prefill completion for unit tests, exposes these utilities via nanobind Python bindings, and updates multiple unit tests to use the new simulation utility. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp (1)
5184-5189:⚠️ Potential issue | 🟠 MajorApply the simulated prefill completion to the stored requests, not a copy.
Line 5184 iterates
llmRequestsby value, so Line 5188 only updates a temporary. The vector returned fromfillKvCacheManager(...)still contains requests with stale prefill state, and the laterremoveSequence(...)calls inFillKvCacheAndCompleteRequestsTestwill keep exercising the wrong state.🔧 Suggested fix
- for (auto request : llmRequests) + for (auto& request : llmRequests) { kvCacheManager.addSequence(request.mRequestId, request.getPromptLen(), maxBeamWidth, request); request.mState = LlmRequestState::kGENERATION_IN_PROGRESS; tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(request); kvCacheManager.storeContextBlocks(request); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp` around lines 5184 - 5189, The loop over llmRequests is iterating by value so simulatePrefillCompletion and the state change only affect a copy; change the loop to iterate by reference (e.g., for (auto& request : llmRequests)) so kvCacheManager.addSequence, setting request.mState = LlmRequestState::kGENERATION_IN_PROGRESS, tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(request), and kvCacheManager.storeContextBlocks(request) modify the actual entries returned by fillKvCacheManager(...) and later removeSequence(...) will see the updated prefill state in the FillKvCacheAndCompleteRequestsTest.
🧹 Nitpick comments (1)
cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h (1)
29-40: Narrow this helper's contract.It only calls
setContextCurrentPosition(getPromptLen());LlmRequestStateand the rest of the post-prefill bookkeeping stay unchanged. Either drive the full post-prefill transition here or rename/doc it as a materialization-only shim so tests do not assume stronger behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h` around lines 29 - 40, The helper simulatePrefillCompletion only sets llmRequest.setContextCurrentPosition(llmRequest.getPromptLen()) but leaves LlmRequestState and other post-prefill bookkeeping unchanged; either make it perform the full post-prefill transition (update LlmRequestState and any other post-prefill flags/members) or narrow its contract by renaming and documenting it as a materialization-only shim. Specifically, either implement the full post-prefill steps on batch_manager::LlmRequest (matching what production prefill completion does) inside simulatePrefillCompletion, or rename the function (e.g., simulatePrefillMaterialization) and update its comment to state it only calls setContextCurrentPosition/getPromptLen and does NOT modify LlmRequestState or other bookkeeping, and update all tests to use the new name/expectations.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cpp`:
- Around line 403-406: The call to
tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(*)
inside the if (llmReq->getContextRemainingLength() == 0) branch is a no-op
because remaining==0 already implies llmReq->getContextCurrentPosition() ==
llmReq->getPromptLen(); to actually exercise the materialized-prefix path either
remove the simulatePrefillCompletion() call from this branch or change the test
setup/condition so simulatePrefillCompletion() runs when
contextRemainingLength() > 0 (or set llmReq so getContextCurrentPosition() <
getPromptLen() before calling simulatePrefillCompletion()), then call
kvCacheManager->storeContextBlocks(*llmReq) — update the condition or request
initialization accordingly to ensure simulatePrefillCompletion() meaningfully
mutates the request state.
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Line 34: Update the NVIDIA copyright header in the modified file
kvCacheManagerTest.cpp to reflect 2026 (the latest meaningful modification
year); locate the file header at the top of kvCacheManagerTest.cpp (the test
file that includes "tensorrt_llm/testing/kvCacheManagerTestUtil.h") and change
the year range/end year to 2026 so the header is current.
---
Outside diff comments:
In `@cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`:
- Around line 5184-5189: The loop over llmRequests is iterating by value so
simulatePrefillCompletion and the state change only affect a copy; change the
loop to iterate by reference (e.g., for (auto& request : llmRequests)) so
kvCacheManager.addSequence, setting request.mState =
LlmRequestState::kGENERATION_IN_PROGRESS,
tensorrt_llm::testing::KvCacheManagerTestUtil::simulatePrefillCompletion(request),
and kvCacheManager.storeContextBlocks(request) modify the actual entries
returned by fillKvCacheManager(...) and later removeSequence(...) will see the
updated prefill state in the FillKvCacheAndCompleteRequestsTest.
---
Nitpick comments:
In `@cpp/tensorrt_llm/testing/kvCacheManagerTestUtil.h`:
- Around line 29-40: The helper simulatePrefillCompletion only sets
llmRequest.setContextCurrentPosition(llmRequest.getPromptLen()) but leaves
LlmRequestState and other post-prefill bookkeeping unchanged; either make it
perform the full post-prefill transition (update LlmRequestState and any other
post-prefill flags/members) or narrow its contract by renaming and documenting
it as a materialization-only shim. Specifically, either implement the full
post-prefill steps on batch_manager::LlmRequest (matching what production
prefill completion does) inside simulatePrefillCompletion, or rename the
function (e.g., simulatePrefillMaterialization) and update its comment to state
it only calls setContextCurrentPosition/getPromptLen and does NOT modify
LlmRequestState or other bookkeeping, and update all tests to use the new
name/expectations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0855f8a5-dac7-4461-8c53-cb3568dacef1
📒 Files selected for processing (11)
cpp/tensorrt_llm/batch_manager/kvCacheManager.cppcpp/tensorrt_llm/nanobind/CMakeLists.txtcpp/tensorrt_llm/nanobind/bindings.cppcpp/tensorrt_llm/nanobind/testing/kvCacheManagerTestUtilBinding.cppcpp/tensorrt_llm/nanobind/testing/kvCacheManagerTestUtilBinding.hcpp/tensorrt_llm/testing/kvCacheManagerTestUtil.hcpp/tests/unit_tests/batch_manager/capacitySchedulerTest.cppcpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cppcpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpptests/unittest/_torch/executor/test_resource_manager.pytests/unittest/llmapi/test_llm_kv_cache_events.py
f30dbf2 to
20131a8
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #42198 [ run ] triggered by Bot. Commit: |
|
PR_Github #42198 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42322 [ run ] triggered by Bot. Commit: |
|
PR_Github #42322 [ run ] completed with state
|
Signed-off-by: thorjohnsen <41591019+thorjohnsen@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #42365 [ run ] triggered by Bot. Commit: |
|
/bot kill |
…ntamination_release_1.3.0rc5.post2
|
PR_Github #42389 [ kill ] triggered by Bot. Commit: |
|
PR_Github #42389 [ kill ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #42797 [ run ] triggered by Bot. Commit: |
|
PR_Github #42797 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #42829 [ run ] triggered by Bot. Commit: |
|
PR_Github #42829 [ run ] completed with state
|
Summary
release/1.3.0rc5.post2Test plan
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cppcpp/tests/unit_tests/multi_gpu/cacheTransceiverTest.cpp🤖 Generated with Claude Code
Summary by CodeRabbit
Bug Fixes
Tests